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ABSTRACT 

>  A  barrier  is  a  method  for  synchronising  a  large  number  of  con¬ 
current  computer  processes.  After  considering  some  basic  synchroniza¬ 
tion  mechanisms,  a  collection  of  barrier  algorithms  with  either  linear  or 
logarithmic  depth  will  be  presented.  A  graphical  model  is  described 
that  profiles  the  execution  of  the  barriers  and  other  parallel  program¬ 
ming  constructs.  This  model  shows  how  the  interaction  between  the 
barrier  algorithms  and  the  work  that  they  synchronize  can  impact  their 
performance.  One  result  is  that  logarithmic  tree  structured  barriers 
show  good  performance  when  synchronizing  fixed  length  work,  while 
linear  self-scheduled  barriers  shew  better  performance  when  synchroniz¬ 
ing  fixed  length  work  with  an  imbedded  critical  section.  The  linear  bar¬ 
riers  are  better  able  to  exploit  the  process  skew  associated  with  critical 
sections.  Timing  experiments,  performed  on  an  eighteen  processor 
Flex/32  shared  memory  multiprocessor,  that  support  these  conclusions 
are  detailed.  ' 
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1.  Introduction 

A  barrier  is  a  method  for  synchronizing  a  large  number  of  concurrent  computer 
processes.  It  is  a  convenient  programming  tool  if  the  completion  of  one  part  of  a 
parallel  program  is  required  before  any  processes  may  begin  execution  of  the  next  part. 
This  paper  will  develop  and  consider  the  relative  performance  of  a  variety  of  different 
barrier  algorithms.  The  performance  of  the  barrier  algorithms  will  be  modeled  in 
terms  of  a  shared  memory  multiprocessor.  Interestingly  enough,  the  interaction  of  the 
barrier  algorithms  with  the  arrival  behavior  and  departure  requirements  of  the  the 
various  processes  may  impact  their  performance  dicsatically.  We  will  see  that 
different  barrier  implementations  can  deliver  the  best  performance  under  differing  run 
time  conditions.  Actual  timing  data  will  be  considered. 

In  an  attempt  to  provide  for  a  fair  comparison,  all  the  barrier  algorithms 
presented  will  have  the  following  points  in  common.  All  algorithms  will  reinitialize 
themselves  during  use,  so  that  the  barriers  may  be  used  repetitively  in  loops,  etc. 
None  of  the  barriers  will  have  execution  times,  synchronizational  complexity  or  data 
requirements  greater  than  proportional  to  np,  where  np  is  the  number  of  processes 
participating  in  the  barrier.  The  barrier  algorithms  all  function  correctly,  even  if  one 
or  more  of  the  participating  processes  are  suspended  for  a  finite  length  of  time  at  any 
point  during  execution.  Finally,  each  barrier  allows  for  a  sequential  code  block  to  be 
executed  by  a  single  process;  that  is  to  say,  when  executing  a  barrier  all  processes  will 
synchronize  (the  entry  phase),  then  one  process  will  execute  the  sequential  parti,  then 
all  processes  will  be  released  (the  signal  phase). 

The  barrier  is  a  control  oriented  synchronization.  If  one  knows  that  all  processes 
have  arrived  at  a  particular  point,  then  indirectly  one  knows  that  all  data  references  of 
the  previous  parallel  computation  have  been  completed.  It  structures  a  program  into  a 
sequence  of  parallel  computations.  Successive  barriers  approximate  the  well  known 
Fork  -  Join  concept.  A  barrier  is  similar  in  nature  to  a  Join  followed  by  a  Fork, 
except  that  the  number  of  processes  necessarily  remains  fixed  across  a  barrier,  and 
each  process  preserves  its  private  rm"»ory  state  across  a  barrier.  In  a  situation  analo- 
gous  to  the  code  block  allowed  l  tween  a  Join  and  a  Fork,  the  barrier  algorithms 
presented  in  this  paper  allow  an  op  '*al  sequential  code  block  between  the  parallel 
execution  phases  that  are  being  synchronized. 

The  barrier  was  first  proposed  in  1978  as  a  hardware  feature  for  the  Finite  Ele- 
ment  Machine  being  developed  at  the  time  at  NASA  Langley  (1).  Since  then  the  bar* 
rier  has  evolved  into  a  widely  used  parallel  control  construct.  Barriers  are  included  in 
several  programming  paradigms,  including  software  planned  for  IBM’s  RP3  projectl 
[2],  Lusk  and  Overbeek's  monitors  [3],  Jordan’s  programming  language  Force  [4],  and 
the  synchronization  primitives  described  by  Frederickson,  Jones  and  Smith  [5]. 
Sequent  Computer  Systems  includes  a  barrier  implementation  with  the  parallel  pro¬ 
gramming  library  supplied  with  their  Balance  8000/21000  systems  [6).  Although  bar¬ 
riers  are  widely  used  and  the  semantics  are  well  understood,  barrier  implementations 
have  seen  little  attention  in  the  literature,  with  Axelrod’s  analysis  of  the  butterfly  bar¬ 
rier  in  [7]  being  a  notable  exception.  This  paper  proposes  to  investigate  high  perfor¬ 
mance  barrier  algorithms  in  great  detail  and  to  present  some  new  barrier  concepts, 
such  as  an  alternating  polarity,  simple  tree  structured  communications,  and  elimina¬ 
tion  of  the  need  for  atomic  read— write  cycles  in  shared  memory.  These  algorithms  will 
be  compared  to  existing  algorithms  when  appropriate. 

2.  Argument  for  correctness  of  the  algorithms 

The  barrier  algorithms  presented  below  will  be  directed  towards  a  shared  memory 
MIMD  multiprocessor.  The  barriers  require  both  shared  variables  which  can  be 
accessed  by  all  processes,  and  private  variables  which  are  unique  for  each  process.  For 
example,  data  structures  used  to  implement  synchronization  must  be  kept  in  shared 
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memory;  np,  the  number  of  processes,  could  be  stored  as  either  a  shared  or  private 
variable  (since  its  value  should  not  change);  while  the  process  id,  a  unique  integer 
between  one  and  np  that  identifies  each  process,  must  be  stored  in  a  private  location. 

In  the  process  of  developing  the  algorithms,  three  basic  concepts  appeared.  First, 
some  algorithms  require  only  boolean  variables  in  shared  memory,  while  others  use 
spinlock  routines  that  provide  for  atomic  read— write  access  to  shared  memory  loca¬ 
tions.  Second,  the  communication  pattern  of  the  entry  phase  may  be  either  linearly  or 
tree  structured.  And  finally,  barriers  may  have  symmetric  entry  and  signal  phases,  or 
the  signal  phase  may  be  implemented  as  •  Vrc-'Wst  identifying  a  reversal  of  polarity. 

The  hardware  requirements  for  these  algorithms  are  quite  minimal.  Only  the 
availability  of  shared  and  private  memory  is  required  by  half  of  the  algorithms.  The 
other  algorithms  require  tpinlock  routines  in  addition.  A  spinlock  is  a  software  con¬ 
struct  which  provides  for  mutual  exclusion.  Spinlocks  typically  are  based  upon  a 
hardware  machine  instruction  allowing  an  atomic  read— write  cycle  when  accessing 
shared  memory.  A  spinlock  atomically  performs  the  following  two  actions:  it  waits 
(spins)  until  its  argument  (called  a  lock)  is  clear,  and  then  it  sets  its  argument.  Unlock 
unconditionally  clears  (unlocks)  its  argument. 

2.1.  Synchronising  shared  variables  vs.  locks 

It  should  be  reiterated  that  attention  will  be  restricted  to  synchronization  that 
will  work  repetitively.  Consider  three  different  ways  of  synchronizing  two  processes 
as  shown  in  Figure  2-1.  These  two  process  synchronization  "mechanisms'’  can  be  used 
as  building  blocks  to  achieve  larger  barrier  algorithms.  The  dimensions  specified  for 
the  data  structures  in  Figure  2-1  are  those  required  by  typical  algorithms  when  syn¬ 
chronizing  np  processes.  The  first  algorithm  relies  only  upon  shared  variables.  The 
state  transitions  of  the  shared  boolean  variables  are  used  as  signals  between  different 
processes.  This  algorithm  will  be  contrasted  with  others  that  use  spinlock  routines. 

The  first  algorithm,  (1),  works  because  only  two  processes  access  flag(i),  and  each 
can  cause  only  one  of  the  two  transition/state  changes.  First,  in  order  to  indicate  for 
the  master  that  it  has  arrived,  the  slave(i)  causes  the  negative  transition  of  flag(i). 
Once  the  master  has  verified  that  this  negative  transition  has  occurred,  then  the  mas¬ 
ter  responds  by  sending  the  slave  an  exit  signal,  the  positive  transition  of  flag(i).  The 
slave  must  wait  for  this  positive  transition  to  occur  before  it  is  released  from  the  bar¬ 
rier.  There  are  two  wait  states,  each  of  which  is  similar  in  function  to  a  spinlock. 
This  algorithm  splits  neatly  into  entry  and  signal  phases,  and  the  master  may  execute 
a  sequential  code  block  between  its  wait  and  set  instructions. 

A  similar  synchronization,  (2)  can  be  written  using  spinlocks.  Algorithm  (2b)  is 
semantically  exactly  equivalent  to  (2a),  but  (2b)  expands  the  spinlock  routines  into 
their  logical  components,  allowing  for  an  easier  comparison  with  (1).  The  symbol 
in  Figure  2-1  unites  the  two  logically  distinct  operations  of  a  spinlock.  The  slave 
unlocks  a  lock  that  impedes  the  progress  of  the  master,  and  then  vice  versa.  Algo¬ 
rithm  (2)  is  roughly  the  same  as  (1),  except  that  the  two  possible  waiting  states  are 
implemented  with  spinlocks,  each  of  which  requires  a  lock;  so  separate  entry  and  sig¬ 
nal  data  structures  must  be  used,  resulting  in  twice  the  data  requirement  of  (1).  Algo¬ 
rithm  (2)  does  not  require  the  spinlocks  to  be  implemented  with  atomic  read— write 
cycles. 

Another  two  process  synchronization  mechanism,  one  that  has  been  suggested  for 
use  with  the  proposed  butterfly  barrier  [7],  is  shown  in  (3).  Unlike  the  other  two  algo¬ 
rithms,  (3)  is  entirely  symmetric  with  respect  to  the  pair  of  processes  that  it  synchron¬ 
izes,  which  is  why  the  processes  are  denoted  as  partners,  rather  than  as  master  and 
slave.  Again,  (3a)  is  coded  using  spinlock  and  unlock  routines;  (3b)  is  the  semantic 
equivalent.  At  first  glance,  algorithm  (3)  appears  to  be  superior  to  (2)  because  the  two 
processes  may  proceed  in  parallel.  However,  the  price  to  be  paid  for  this  is  that  there 
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(1) 


(2a) 


(2b) 


(3a) 


(3b) 


Data  required:  Shared  boolean  FLAG(2..np) 
Initialize:  FLAG(*)»true 


MASTER 

wait  until  FLAG(i)  is  false 
set  FLAG(i)  true 


SLAVE(i) 
set  FLAG(i)  false 
wait  until  FLAG(i)  is  true 


Data  required: 
Initialize: 

MASTER 

jpmloc  k(  E  NTR  Y(  i ) ) 
unlock(EXIT(i)) 


Shared  boolean  ENTRY(2..np),  EXIT(2..np) 
ENTRY(*)- locked,  EXIT(*)» locked 

SLAVE(i) 

un!ock(ENTRY(i)) 

*pmlock(EXIT(i)) 


Data  required:  Shared  boolean  ENTRY(2..np),  EXlT(2..np) 
Initialize:  ENTRY(*)-true,  EXIT(*)«true 


MASTER 

wait  until  ENTRY(i)  is  false 
&  set  ENTRY(i)  true 
set  EXIT(i)  false 


SLAVE  (i)  I 
set  ENTRY(i)  false 
wait  until  EXIT(i)  is  false 
&  set  EXlT(i)  true 


Data  required:  Locks  SYNC(l..Iog0(np),  l..np) 

Initialize:  SYNC(  V)-Iocked“ 


PARTNER(i) 

spin-  unlock(SYNC(Iev,j)) 
spinlock(SYNC(lev,i)) 


PARTNER  (j) 

jp»Y»-unlock(SYNC(lev,i)) 

«pinlock(SYNC(lev,j)) 


Data  required:  Shared  boolean  SYNC(l..log2(np),  l..np) 
Initialize:  SYNC(*,*)“  true 


PARTNER(i) 

wait  until  SYN'C(Iev.j)  is  true 
S:  set  SYNC(lev.j)  false 
wait  until  SYNC(Iev.i)  is  false 
Si  set  SYNC(lev,i)  true 


PARTNERQ) 

wait  until  SYNC(Iev.i)  is  true 
&  set  SYNC(lev.i)  false 
wait  until  SYNC(Iev,j)  is  false 
&  set  SYNC(lev,j)  true 


Figure  2-1  Three  basic  two  process  synchronization  mechanisms 


is  no  way  to  incorporate  a  sequential  code  block  into  this  algorithm.  More  impor¬ 
tantly,  this  algorithm  is  incorrect  for  repetitive  barriers  if  it  is  implemented  with  an 
unconditional  unlock  routine.  For  example,  if  PARTNER(i)  is  suspended  after  execut¬ 
ing  its  unlock  instruction,  but  PARTNER(j)  continues  execution  and  reaches  another 
barrier,  then  PARTNER(j)  may  unlock  partner(i)  a  second  time  before  PARTNER(i) 
has  had  a  chance  to  lock  it.  If  this  occurs,  then  the  barrier  fails  and  some  processes 
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will  deadlock  oa  the  final  barrier  to  be  executed.  This  problem  may  be  corrected  if  the 
unlock  is  replaced  with  a  apiri-unlock.  A  spin-unlock  would  wait  until  its  argument  is 
locked,  and  then  unlock  it.  With  this  understanding,  we  see  that  (3)  has  no  advantage 
over  (2),  since  each  now  has  a  depth  of  two  waiting  statements.  Once  again,  the  read- 
-write  atomicity  of  the  spinlock  routine  is  unnecessary  for  this  application. 

Comparing  (1),  which  is  a  master-slave  algorithm,  with  (3),  which  is  symmetric, 
one  can  see  that  (3)  has  the  same  depth  as  (1),  and  exactly  twice  the  computation. 
Algorithm  (3)  is  like  a  rearranged  version  of  (1)  united  with  its  mirror  image  (master 
and  slave  roles  revers'd)  with  both  halves  running  in  parallel.  In  this  fashion  (3) 
becomes  a  symmetric  version  of  (1).  However,  algorithm  (l)  is  simpler  and  sufficient  to 
synchronize  two  processes.  Only  synchronization  mechanisms  (1)  and  (2)  will  be  used 
to  develop  larger  barriers  in  the  sections  below. 

2.2.  Symmetric  structure  vs.  broadcast  exit  polarity 

We  have  seen  that  the  algorithms  presented  so  far  have  distinct  entry  and  signal 
phases.  The  signal  phase  of  a  barrier  may  be  implemented  as  the  symmetric  image  of 
the  entry  phase,  with  the  exit  signal  propagating  out  in  the  same  fashion  as  the  entry 
signal  was  communicated.  Or  one  may  implement  the  signal  part  using  the  polarity 
exit  mechanism  presented  below.  Assigning  a  polarity  to  barrier  iterations  allows  bar¬ 
riers  with  a  single  data  set  to  be  executed  repetitively,  without  requiring  a  signal  phase 
analogous  to  the  entry  phase. 

Earlier  it  was  postulated  that  if  a  shared  boolean  synchronizing  variable  was  used 
by  only  two  processes,  and  each  could  initiate  only  one  of  the  two  possible  state 
changes,  then  no  atomic  read— write  access  would  be  necessary  to  insure  correct  syn¬ 
chronization.  The  exit  phase  of  a  barrier  requires  a  single  process  to  essentially  broad¬ 
cast  a  signal  to  all  others  indicating  that  they  may  exit.  Perhaps,  it  is  possible  for  a 
single  separate  shared  data  element  to  convey  the  exit  information.  Only  one  process 
should  be  able  to  change  the  state  of  this  exit  variable,  while  all  others  would  have 
only  a  read  capability. 

Some  problems  come  to  mind  immediately.  The  exit  phase  of  the  barrier  may 
need  to  serve  to  reinitialize  the  barrier  so  that  it  will  function  properly  on  its  next 
iteration.  Is  it  possible  to  code  an  algorithm,  one  that  uses  a  single  exit  data  variable, 
that  will  also  correctly  reinitialize  itself  for  future  iterations?  The  answer  is  yes,  but  it 
requires  an  increase  in  the  complexity  of  the  algorithms.  The  introduction  of  a  private 
boolean  variable  indicating  the  polarity  of  the  current  barrier  iteration  is  one  way  to 
handle  the  reinitialization  problem. 

Successive  barriers  will  alternate  polarities.  All  processes  will  share  the  same 
polarity  on  a  given  iteration  of  a  barrier,  defining  the  polarity  for  that  barrier  itera¬ 
tion.  The  polarity  is  a  private  variable  for  each  process,  just  as  the  process  id  is 
private.  Processes  will  compare  their  polarity  with  a  shared  exit  variable.  Note,  at  a 
given  point  in  time,  one  process  could  be  entering  a  barrier  with  one  polarity,  while 
another  process  was  still  exiting  the  previous  barrier  of  opposite  polarity. 

A  basic  broadcast  exit  signal  mechanism  is  shown  in  Figure  2-2.  The  barrier  exit 
signal  is  the  change  of  state  of  the  shared  exit  variable.  One  can  see  that  the  slaves 
will  be  able  to  correctly  differentiate  between  successive  exit  signals,  since  the  follow¬ 
ing  barrier  will  have  the  opposite  polarity,  and  all  slaves  will  be  inhibited  until  the 
master  sends  the  next  exit  signal.  This  exit  mechanism  provides  for  a  nearly  simul¬ 
taneous  release  of  concurrent  processes  from  a  barrier,  limited  only  by  how  the  specific 
machine  architecture  handles  concurrent  reads  of  a  single  shared  memory  location. 

However  this  synchronization  mechanism  does  not  give  the  master  any  informa¬ 
tion  as  to  when  and  if  all  the  slaves  have  received  the  exit  signal.  Any  time  a  signal  is 
sent,  there  must  be  a  two  way  Sow  of  information  to  let  the  sender  know  that  the 


Data  requirement: 

Private  boolean>polarity 

Shared  boolean  EXIT 

Initialize: 

polarity* true,  EXIT*  false 

MASTER 

SLAVES 

(4) 

EXIT  :■  polarity 

wait  until  EXIT*  polarity 

polarity  :*  not  polarity 

polarity  :■  not  polarity 

Figure  2-2  Broadcast  exit  signal  mechanism 


signal  has  been  received.  If  the  master  issues  the  next  exit  signal  too  soon,  before  all 
slave  processes  have  quit  waiting  on  the  previous  exit  state  change,  then  the  barrier 
would  be  incorrect.  This  issue  is  solved  if  the  broadcast  exit  mechanism  is  interleaved 
with  a  correct  entry  algorithm.  In  this  fashion,  the  master  would  issue  the  exit  signal 
only  when  assured  that  all  processes  have  entered  the  current  barrier  iteration.  Thus, 
indirectly,  the  master  knows  that  all  processes  have  "seen"  the  previous  exit  signal. 

The  addition  of  an  alternating  polarity  to  the  barrier  is  compatible  with  the 
semantic  barrier  concept,  precisely  since  all  processes  are  required  to  attend  to  each 
iteration  of  a  barrier.  Thus  if  all  processes  are  initialized  to  the  same  polarity,  then  we 
see  that  it  is  impossible  for  processes  to  get  their  polarities  out  of  sync,  no  matter  how 
the  barriers  are  distributed  in  program  code.  Thus,  a  single  master  can  send  many 
slaves  an  exit  signal. 

An  algorithm  somewhat  along  these  lines  has  been  described  in  [4].  That  algo¬ 
rithm  uses  a  system  call  to  signal  an  event;  and  the  operating  system  insures  that  all 
the  slave  processes  waiting  for  the  event  to  occur,  do  in  fact  receive  it. 


Data  requirement:  Private  boolean  polarity 

Shared  boolean  ENTRY'(2..np),  EXIT 

Initialize: 

polarity* true,  ENTRY(*)*  false,  EXIT*false 

MASTER 

SLAVES(i) 

(5) 

for  each  i  do 

wait  until  ENTRY(i 

)*  polarity  ENTRY(i)  :*  polarity 

endfor 

wait  until  EXIT*  polarity 

EXIT  :»  polarity 

polarity  :*  not  polarity 

polarity  :*  not  polarity 

Figure  2-3  Linear  barrier  with  broadcast  exit 


Finally,  the  algorithms  that  use  synchronization  mechanism  (1)  require  the  signal 
phase  to  reinitialize  the  data  structure  for  the  entry  phase.  With  the  use  of  the  broad¬ 
cast  exit  phase,  the  entry  phases  will  need  to  reinitialize  themselves.  It  turns  out  that 
the  same  polarity  state  can  be  used  to  modify  (1)  in  order  to  achieve  this  end.  A 
resulting  linear  algorithm,  (5),  is  presented  in  Figure  2-3.  Let  the  process  with  id*l 
be  designated  the  master,  while  all  other  processes  are  slaves.  The  slaves  are  indexed 
from  2  to  np,  with  the  slave  index  being  equivalent  to  the  process  id.  In  (5)  the  mas¬ 
ter  receives  signals  indicating  that  all  slaves  have  changed  the  state  of  their  entry 
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variables.  Then  and  only  then  does  the  master  broadcast  its  change  of  state.  Thus, 
proper  synchronization  is  insured.  A  tree  structured  version  of  this  algorithm  with 
logarithmic  depth  will  be  developed  in  the  next  section. 

2.3.  Tree  structure  vs.  linear  structure 

The  next  idea  to  be  considered  is  what  type  of  communication  pattern  to  employ 
within  the  entry  phase  of  the  barrier  algorithm.  Either  a  linear  or  tree  structured 
approach  may  be  used.  A  linear  approach  tends  to  be  simpler  since  it  typically 
req  ’-•«  foroer  overhead  calculations.  However,  a  tree  structured  approach  has  loga¬ 
rithmic  depth.  In  order  to  develop  a  tree  structured  barrier,  two  arrangements  of  the 
linear  algorithm  will  first  be  considered.  Consider  the  graphical  representation  in  Fig¬ 
ure  2-4  of  the  same  algorithm  outlined  in  Figure  2-3.  In  Figure  2-4,  processes  are 
represented  by  vertical  lines,  and  time  flows  downward. 

If  we  think  of  the  basic  two  process  synchronization  mechanism  in  terms  of  its 
entry  and  signal  components,  then  we  see  that  the  algorithm  in  Figure  2-4  works  by 
having  a  single  master  accept  all  the  entry  signals,  then  executing  the  sequential  part, 
and  then  issuing  the  exit  signal.  The  ordering  of  the  acceptance  of  the  entry  signals  is 
arbitrary,  but  practical  implementations  will  require  a  pre-scheduled,  fixed  order. 
Usually  a  Fortran  style  do  loop  is  employed  for  this  purpose.  Np  boolean  shared 
memory  variables  are  required.  This  algorithm  is  coded  in  an  extended  Fortran  as 
bbrlin  (linear  broadcast  barrier)  in  Appendix  A.  Also  in  the  Appendix,  a  similar  bar¬ 
rier,  with  symmetric  entry  and  exit  phases  instead  of  a  broadcast  exit,  is  coded  as  bar- 
lin. 

There  is  an  alternate  linear  design,  as  shown  in  Figure  2-5.  Instead  of  having  a 
single  process  accept  all  the  entry  signals  from  its  slaves;  this  design  has  each  process 
accept  the  entry  signal  from  its  next  higher  numbered  neighbor,  and  then  issue  its 
entry  signal  to  the  next  lower  numbered  neighbor.  In  this  manner,  the  entry  signals 
will  propagate  down  to  the  lowest  numbered  process.  This  modified  linear  algorithm 
requires  a  fixed  ordering  in  its  communication  pattern.  The  processes  numbered  1  and 
np  are  special  cases,  thus  this  algorithm  requires  additional  branching  if  a  single  pro¬ 
gram  guides  all  processes.  Thus,  this  algorithm  is  less  efficient  than  the  previous  one. 

However,  there  is  a  point  to  be  made  here.  Using  only  the  two  process  synchroni¬ 
zation  mechanisms  developed  earlier,  both  of  these  linear  algorithms  successfully  syn¬ 
chronize  many  processes.  It  is  possible  for  a  single  process  to  synchronize  with  several 
neighbors,  and  it  is  also  possible  for  processes  to  propagate  several  signals  onward  to 
others.  The  point  here  is  that  if  one  accepts  the  validity  of  both  of  these  algorithms, 
then  it  is  a  simple  matter  to  postulate  the  existence  of  a  binary  (or  other  dimensional) 
tree  structured  algorithm.  For  the  binary  tree,  at  each  node  a  process  would  accept 
the  entry  signal  from  one  neighbor,  and  then  it  would  propagate  this  signal  along  with 
its  own  presence  to  the  next  lower  level  of  the  tree.  A  binary  tree  structured  algorithm 
is  presented  graphically  in  Figure  2-6,  and  coded  as  bbrtre  (broadcast  tree)  in  Appen¬ 
dix  A.  In  Figures  2-6  and  2-7,  array  subscripts  are  shown  in  order  to  identify  the 
shared  variable  being  operated  on  at  each  point  in  the  tree. 

This  is  a  powerful  algorithm.  The  depth  is  only  I  +  f  log2(np)  1.  As  with  the 
linear  algorithms,  only  np  boolean  shared  memory  data  elements  are  required.  This 
barrier  is  completely  self  resetting  and  airtight,  in  the  sense  that  if  one  or  more 
processes  are  suspended  during  execution,  the  barrier  is  delayed  but  otherwise  contin¬ 
ues  to  function  correctly. 

While  the  depth  is  logarithmic,  each  stage  of  a  tree  barrier  requires  m->re  compu¬ 
tation  than  for  a  linear  barrier,  since  not  only  must  the  synchronization  be  accom¬ 
plished,  but  each  process  must  first  calculate  with  which  neighbor  to  synchronize. 
This  calculation  is  further  complicated  if  np  is  not  constrained  to  be  a  power  of  two. 
The  algorithm  shown  in  Appendix  A  dynamically  calculates  with  which  neighbors  to 
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Figure  2-4  Linear  barrier  graph  (nested  entry  structure) 
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Figure  2-6  Binary  tree  barrier  with  broadcast  exit 


synchronize.  It  requires  about  five  primitive  integer  operations  (shift,  compare,  add) 
in  order  to  calculate  the  neighbors  id,  at  each  stage  of  the  entry  phase.  Surprisingly,  a 
version  which  precalculates  these  id’s  and  then  stores  them  in  a  private  array,  requires 
nearly  as  many  integer  operations  to  fetch  the  numbers  from  the  array.  However, 
some  run  time  advantage  would  probably  be  achieved  using  the  precalculated 
approach,  at  the  cost  of  an  additional  data  structure. 

Instead  of  using  the  broadcast  exit  mechanism,  it  is  possible  to  have  a  double  tree 
structured  barrier,  with  the  symmetric  entry  and  signal  phases.  A  graphical  version  of 
this  barrier  is  shown  in  Figure  2-7.  For  a  hard  coded  example  of  this  algorithm,  see 
bartre  in  Appendix  A.  The  depth  is  now  increased  to  2log„(np),  but  we  can  do  away 
with  the  polarity  concept,  simplifying  the  environment  somewhat,  and  also  (np-1) 
processes  are  not  all  competing  to  read  a  single  exit  variable  at  once.  The  same  shared 
boolean  data  structure,  an  array  indexed  from  2  to  np,  is  used  by  both  the  entry  and 
signal  parts. 

If  one  wanted  to  use  spinlocks,  then  it  is  stiil  possible  to  employ  a  tree  structured 
algorithm.  Spinlock  routines  will  usually  be  more  expensive,  but  they  may  provide 
superior  performance  on  certain  types  of  hardware  (hardware  interrupt  driven  lock 
tables,  for  example).  Barrier  algorithms  coded  with  spinlock  routines  require  separate 
data  structures  to  be  used  for  the  entry  and  signal  parts  of  the  barrier,  unless  a  broad¬ 
cast  exit  signal  is  used.  A  tree  barrier  with  broadcast  exit  and  a  double  tree  barrier, 
both  using  locks,  are  coded  as  bbrtrl  and  bartrl,  respectively,  in  Appendix  A.  These 
are  some  barrier  algorithms  that  use  two  process  synchronization  mechanisms  as  their 
building  blocks. 


Figure  2-7  Double  tree  barrier 


2.4.  Linear  barriers  with  critical  sections 

Another  quite  different  linear  approach  is  possible,  one  that  has  traditionally  been 
employed.  In  this  algorithm,  a  critical  section  using  either  an  entry  or  exit  lock  is  used 
to  protect  a  shared  counter  variable.  Processes  count  in  (except  the  last),  and  then 
spin  on  the  exit  lock  until  the  counter  is  equal  to  np,  at  which  point  they  count  out 
(except  the  last).  When  all  processes  have  counted  out,  the  input  lock  is  reset,  allow¬ 
ing  the  processes  to  reenter  again  on  the  next  barrier  iteration.  Thus,  the  counter 
variable  swings  between  one  and  np,  and  is  either  (under  protection  of  the  entry  or 
exit  lock)  monotonically  increasing  or  monotonically  decreasing,  until  an  endpoint  is 
reached,  at  which  point  it  is  reversed.  Careful  coding  allows  the  use  of  only  two  locks, 
and  each  process  (except  the  last)  requires  two  accesses  into  a  critical  section  per  bar¬ 
rier  iteration.  Processes  will  be  skewed  in  time  somewhat  as  they  are  go  through,  one 
by  one,  the  entry  and  exit  critical  regions.  The  average  depth  of  this  algorithm  is  only 
np,  not  2np,  since  the  entry  and  exit  phases  may  effectively  be  overlapped,  if  there  is 
sufficient  work  between  iterations  of  the  barrier. 

This  algorithm  is  shown  in  Figure  2-8.  The  pseudo  code  shown  below  (Figures  2-8 
&  2-9)  is  to  be  executed  by  all  processes  participating  in  the  barrier.  A  modified  ver¬ 
sion  of  this  algorithm  is  coded  as  barlok  in  Appendix  A.  Barlok  has  been  modified  to 
allow  a  sequential  part  within  the  barrier,  and  to  insure  that  the  sequential  part  will 
always  be  executed  by  the  same  process.  On  a  side  note,  it  may  be  desirable  to  insure 
that  the  same  process  always  executes  the  sequential  part  of  a  barrier.  For  example,  if 
there  are  private  variables  used  within  the  sequential  part  on  successive  barrier  itera¬ 
tions,  then  allowing  different  processes  to  execute  the  sequential  parts  may  introduce 
unwanted  non-determinism  into  a  program's  execution.  Also,  the  sequential  part  of  a 
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barrier  is  often  used  for  file  i/o;  and  on  some  machine  architectures  file  i/o  is 
simplified  if  the  same  process  always  does  the  i/o. 


Data  requirement:  Locks  ENTRY,  EXIT 

Shared  Integer  COUNTER 

Initialize:  ENTRY*  unlocked,  EXIT*  locked, 

COUNTER- 1 

ALL  PROCESSES 

sp»nlock(ENTRY) 
if  (COUNTER  <  np)  then 

COUNTER :-  COUNTER  +  I 

unlock(ENTRY) 

jpinlock(EXIT) 

endif 

if  (COUNTER  *  1)  then 
unlock(  ENTRY) 
else 

COUNTER :-  COUNTER  -  1 
unlock(  EXIT) 
endif 


Figure  2*8  Two  lock  barrier  algorithm 


A  version  of  the  two  lock  barrier  that  incorporates  the  broadcast  exit/polarity 
mechanism  is  given  in  Figure  2-9.  Only  one  lock  for  the  single  critical  section  is 
required.  Under  protection  of  the  critical  section,  processes  decrement  the  shared 
counter.  The  last  process  to  decrement  the  counter  then  assumes  the  role  of  master 
and  issues  the  exit  signal  to  all  the  other  processes.  Once  again,  a  modified  version  of 
this  algorithm  is  coded  in  Appendix  A  as  bbrlok.  The  modified  version  allows  a 
sequential  part  and  insures  that  the  process  with  id-1  will  always  be  the  one  to  exe¬ 
cute  the  sequential  part. 

It  should  be  noted  that  fetch  and  add  (8)  hardware  can  eliminate  the  need  for  criti¬ 
cal  sections  entirely  and  reduce  these  algorithms  to  logarithmic  depth.  Tang  and  Yew 
outline  a  barrier  algorithm  incorporating  the  use  of  fetch  and  add  though  Cedar  primi- 
:  .es  |9j.  However,  that  implementation  requires  subsequent  barrier  iterations  to  use 
afferent  data  sets  in  order  to  guarantee  correct  execution  free  of  race  conditions. 
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Data  requirement:  Private  boolean  polarity 

Private  integer  mycount 
Shared  boolean  EXIT 
Shared  Integer  COUNTER 
Locks  ENTRY 

Initialize:  polarity  “true,  EXIT“  false, 

COUNTER- np,  ENTRY- unlocked 

ALL  PROCESSES 

4pinlock(ENTRY) 
mycount  :—  COUNTER  -  1 
COUNTER  :-  mycount 
unlock(ENTRY) 

if  ( mycount- 0)  then 
COUNTER :-  np 
EXIT  :-  polarity 
else 

wait  until  EXIT- polarity 
endif 

polarity  :—  not  polarity 


Figure  2-9  Single  lock  barrier  with  broadcast  exit 


3.  A  graphical  run  time  parallel  execution  model 

Why  develop  so  many  different  barrier  algorithms,  when  they  all  achieve  the  same 
function?  Obtaining  the  best  run  time  execution  speed  usually  is  the  primary  concern. 
In  this  section,  a  graphical  model  will  be  used  to  investigate  the  run  time  performance 
of  the  barriers.  Barriers  tend  to  maximize  the  negative  effects  of  uneven  load  balanc¬ 
ing  between  processes  between  barrier  iterations.  However,  this  type  of  inefficiency  is 
due  to  the  programming  application,  and  is  thus  beyond  the  scope  of  this  paper.  What 
is  of  interest  here  is  the  additional  overhead,  if  any,  introduced  by  the  barrier  algo¬ 
rithm.  Specifically,  the  interaction  of  the  barriers  with  the  parallel  programming  con¬ 
structs  that  they  synchronize  will  be  examined.  The  analysis  here  will  not  attempt  to 
be  exhaustive,  it  is  instead  an  attempt  to  gain  some  insight  into  the  run  time  behavior 
of  the  different  barrier  algorithms. 

Parallel  execution  within  a  given  programming  construct  will  be  modeled  using 
profiles.  Profiles  are  shown  as  two  dimensional  geometric  shapes.  A  profile  includes 
within  its  perimeter  ail  the  computation  corresponding  to  the  programming  construct 
that  it  represents.  On  an  x,y  grid  profiles  are  plotted  by  processes  against  time.  As 
in  the  previous  graphs,  the  processes  are  plotted  along  the  x  axis,  and  time  flows  down¬ 
ward  along  the  y  axis.  Computation  internal  to  a  profile  is  not  of  interest.  What  is 
shown  by  a  profile  is  the  time  that  each  process  enters  and  then  exits  a  given  con¬ 
struct.  We  limit  our  attention  to  parallel  programming  constructs  that  are  executed 
(on  each  iteration)  by  all  of  the  processes. 

The  power  of  the  model  lies  in  seeing  how  well  different  combinations  of  profiles 
fit  together.  This  mode)  will  consider  three  categories  of  parallel  programming  con¬ 
structs:  parallel  work  blocks,  critical  sections,  and  barriers.  This  programming  model 
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supposes  that  an  arbitrary  but  fixed  number  of  processes  execute  a  single  program  con¬ 
sisting  of  these  constructs.  The  goal  is  to  minimize  the  execution  time  of  a  given 
sequence  of  parallel  programming  constructs.  This  execution  time  is  modeled  by 
measuring  the  elapsed  distance  along  the  y-axis  occupied  by  the  corresponding 
sequence  of  profiles.  No  portion  of  a  given  profile  may  be  superimposed  on  any  part  of 
another  profile.  If  adjacent  profiles  do  not  fit  together  exactly,  then  the  resulting 
white  space  is  wasted  in  the  sense  that  processes  are  just  spinning,  although  this  white 
space  may  be  semantically  necessary.  A  key  point  here  is  that  other  constructs, 
including  the  barrier  itself,  are  free  to  exploit  this  white  space  without  reducing  the 
overall  performance.  Timing  runs  supporting  this  analysis  will  be  presented  in  the  fol¬ 
lowing  section. 

Parallel  work  blocks  are  assumed  to  be  non-blocking  constructs,  consisting  of 
some  scheduling  mechanism  which  parcels  out  chunks  of  single  stream  work  to  the 
various  processes.  These  chunks  of  work  may  then  be  executed  in  parallel.  The  area 
of  a  parallel  work  profile  corresponds  to  the  total  computation  and  scheduling  over¬ 
head  associated  with  that  parallel  work  block. 

Critical  sections  provide  for  mutual 
exclusion.  The  profile  for  a  critical  section 
will  contain  only  the  computation  that  a 
process  performs  while  it  is  actually  within 
the  critical  section.  The  time  spent  wait¬ 
ing  by  processes  that  are  temporarily 
blocked  by  a  critical  section  is  shown  as 
white  space  in  the  model.  This  kind  of 
waiting  is  caused  by  the  semantic  concept 
of  a  critical  section,  so  it  is  not  appropriate 
.to  include  it  as  part  of  the  cost  of  the 
implementation  of  the  critical  section. 

The  barrier,  another  blocking  con¬ 
struct,  is  treated  in  a  similar  fashion. 
Blocking  that  is  semantically  inherent  to  a 
barrier  will  not  be  included  within  its 
profile.  Again,  this  kind  of  waiting  is 
shown  as  white  space.  Consider  an  ideal 
barrier  as  shown  in  Figure  3-1.  Since  there 
is  no  computational  overhead  associated 
with  an  ideal  barrier,  the  profile  for  this 
barrier  is  shown  as  a  horizontal  line,  with 
no  thickness.  It  will  block  the  processes 
that  encounter  it  until  all  have  arrived.  If, 
for  example,  several  processes  have  encountered  a  barrier  and  these  processes  are  wait¬ 
ing  for  some  stragglers,  then  this  waiting  is  semantically  inherent  to  the  barrier  and  is 
shown  as  white  space.  However,  additional  waiting  or  computation  required  only  by 
the  specific  implementation  of  a  barrier  algorithm  will  be  included  within  the  profile 
corresponding  to  that  barrier.  Thus,  the  profiles  for  non-ideal,  real  barriers  attempt  to 
show  the  overhead  costs  associated  with  the  barrier  implementation.  Note,  the  profile 
for  a  correct  real  barrier  must  include  within  its  perimeter  the  profile  of  the  ideal  bar¬ 
rier. 

The  abtolale  depth  of  a  profile  refers  to  the  elapsed  distance  along  the  y  axis, 
meaning  the  elapsed  time,  that  a  profile  would  require  if  it  were  sandwiched  between 
two  ideal  barriers.  Under  certain  circumstances,  adjacent  profiles  are  able  to  overlap 
all  or  part  of  their  execution.  Overlap  does  not  refer  to  physical  superposition  of  the 
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profiles  (which  is  not  allowed);  instead  overlap  refers  to  the  situation  where  some 
processes  are  still  executing  within  one  profile,  while  others  are  already  executing 
within  the  next  profile.  Thus,  the  execution  of  separate  constructs  may  be  partially 
overlapped  in  time.  If  overlap  occurs  between  two  or  more  successive  profiles  then  one 
can  see  that  the  resulting  effective  depth  of  a  sequence  of  profiles  will  be  less  than  the 
sum  of  the  absolute  depths. 

An  interface  between  two  adjacent  profiles  is  defined  as  the  exit  contour  of  the 
first  profile  taken  together  with  the  entry  contour  of  the  following  profile  .  The  degree 
of  overlap  between  two  adjacent  constructs  depends  upon  this  interface.  Two  general 
situations  can  occur.  Entry  contours  may  be  either  pre-scheduled  or  self-scheduled 
with  respect  to  the  process  id.  If  processes  must  enter  a  profile  in  a  specific  order,  then 
the  corresponding  entry  contour  is  pre-scheduled.  Pre-scheduled  entry  contours  may 
or  may  not  be  able  to  overlap  with  uneven  exit  contours,  depending  on  the  order  in 
which  processes  are  released  from  the  exit  contour.  On  the  other  hand,  if  processes 
may  enter  a  profile  in  any  order,  the  entry  contour  is  termed  self-scheduled  and  it  can 
overlap  as  much  as  is  possible  with  the  preceding  exit  contour.  If  an  interface  is  pre¬ 
scheduled.  we  plot  the  processes  along  the  x  axis  in  the  order  of  their  process  ids,  from 
one  to  np.  where  np  is  the  number  of  processes.  However,  if  an  interface  is  self- 
scheduled.  the  processes  are  plotted  from  fastest  to  slowest.  The  fastest  process  is 
defined  as  the  first  process  to  enter  (or  exit)  a  given  iteration  of  a  parallel  program¬ 
ming  construct.  Likewise,  the  slowest  process  is  defined  as  the  last  process  to  enter  (or 
exit)  a  parallel  programming  construct.  In  this  fashion,  all  the  processes  may  be 
ranked  from  fastest  to  slowest.  (Note,  the  designation  of  fastest  or  slowest  may'vary 
dynamically  among  the  processes.)  The  ability  to  exchange  the  two  orderings,  either 
from  one  to  np  or  from  fastest  to  slowest,  requires  that  the  processes  involved  be  fairly 
homogeneous.  These  two  orderings  of  the  processes  will  prove  useful  when  analyzing 
the  interfaces  between  successive  profiles. 

Three  parallel  vork  blocks  interspersed  with  ideal  barriers  are  shown  in  Figure 
3-1.  For  the  sake  of  simplicity,  the  optional  sequential  code  blocks  of  the  barriers  are 
ignored.  The  absolute  depth  of  a  parallel  work  block  is  given  in  (6),  where  W(id)  is 
the  time  that  each  process  requires  to  do  its  share  of  the  work.  In  the  case  of  fixed 
length  work,  the  W(id)’s  will  all  be  the  same,  so  the  overall  depth,  Wnp,  is  then 
equivalent  to  W(id).  The  absolute  depth  of  a  critical  section,  Cnp,  is  given  in  (7), 
where  C( id )  is  the  time  each  process  spends  inside  the  critical  section.  If  C(id)  is  a 
constant,  then  Cnp  simplifies  to  np*C(id).  However,  even  if  each  process  executes 
the  same  code  in  a  given  critical  section,  C(id)  may  not  be  a  strict  constant.  If  the 
time  required  for  a  process  to  signal  that  it  is  exiting  a  critical  section  is  proportional 
to  the  number  of  processes  actively  waiting  for  that  signal,  then  C(id)  has  a  depen¬ 
dence  on  the  number  of  processes  seeking  access  to  the  critical  section. 


(6) 

Wop  -  MAX  W(id) 

(7) 

CDP-  f  C(id) 

lrf-1 

The  eight  barrier  algorithms  developed  earlier  will  be  divided  into  the  following 
three  classes:  linear  self-scheduled,  linear  pre-scheduled,  and  tree  structured.  The 
profiles  model  will  be  employed  to  illustrate  some  differences  in  behavior  among  these 
classes  of  algorithms.  For  each  class  of  barrier,  three  cases  will  be  examined:  barriers 
interspersed  with  fixed  length  work,  barriers  interspersed  with  pre-scheduled  variable 
length  work,  and  barriers  interspersed  with  fixed  length  work  which  contains  a  critical 
section.  When  the  barrier  and  work  profiles  are  combined,  the  effective  depth  of  the 
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barrier  is  defined  to  be  the  increase  in  depth  over  the  absolute  depth  of  the  work 
block. 

3.1.  Linear  self-scheduled  barriers 

The  single  lock  barrier  is  a  linear  barrier  algorithm  with  depth  proportional  to 
np.  The  profile  of  the  single  lock  barrier  has  a  self-scheduled  entry  contour,  meaning 
that  the  processes  may  enter  in  any  order,  but  no  faster  than  one  at  a  time.  Self¬ 
scheduling  is  implemented  through  the  use  of  critical  sections  internal  to  the  barrier. 
The  two  lock  barrier  is  similar  to  the  single  lock  barrier,  except  it  has  symmetric  struc¬ 
ture,  releasing  the  processes  one  at  a  time,  as  well.  Although  the  absolute  depth  of  the 
two  lock  barrier  is  twice  that  of  the  single  lock  barrier,  when  synchronizing  fixed 
length  work,  the  effective  depth  of  either  of  these  barriers  is  np.  This  is  apparent  from 
Figure  3-2. 


Single  1  &  barrier 
(broadcast  exit) 


linear  self- 
sched  barrier 


fixed  work 


linear  self- 
scbed  barrier 


variable 

work 


linear  self- 
sched  barrier 

fixed  work 

critical  section 

fixed  work 

linear  self- 
sched  barrier 


(b) 

Two  loot  bamer 
(symmetric  structure) 


Figure  3-2 

Profiles:  linear  seif-sched  barriers 


time 


variable  work 
(best  case) 


If  the  parallel  work  is  of  variable  length,  then  the  analysis  becomes  more  compli¬ 
cated.  Let  the  work  variation  be  pre-scheduled  (no  load  balancing  employed).  Con¬ 
sider,  for  example,  if  one  work  assignment  (or  process  suspension)  dominates  the  work 
distribution.  Considering  the  single  lock  barrier,  the  effective  depth  of  a  parallel  work 
block  dominated  by  a  single  long  work  assignment  will  be  close  to  zero.  (See  Figure  3- 
2.)  As  the  work  load  becomes  more  evenly  balanced,  the  depth  of  the  single  lock 


-15- 


barrier  increases  and  approaches  np.  For  the  two  lock  barrier,  if  the  first  process  to  be 
released  from  the  barrier  receives  this  long  unit  of  work,  this  process  would  be  the  last 
one  to  enter  the  next  barrier  iteration,  resulting  in  an  effective  barrier  depth  of  close  to 
zero.  However,  if  the  last  process  released  receives  the  dominating  work  assignment, 
then  although  this  process  would  still  be  the  last  one  to  enter  the  next  barrier  itera- 
tion,  the  effective  depth  of  the  barrier  is  now  np.  Thus,  for  this  example,  the  average 
effective  depth  of  the  two  lock  barrier  will  be  np/2,  still  linear  but  reduced  by  a  factor 
of  two.  Other  distributions  of  the  variable  length  work  will  show  similar  effects,  the 
degree  of  the  reduction  of  the  effective  depth  will  depend  on  the  exact  distribution, 
however,  on  the  average,  the  effective  depth  of  the  two  lock  barrier  synchronizing  pre- 
scheduled,  variable  length  work  is  between  np  and  np/2.  So,  in  general,  for  the  case 
of  variable  length  work,  the  single  lock  barrier  shows  substantial  preformance 
improvement  over  the  two  lock  barrier.  The  pre-scheduled  variable  length  work  block 
can  be  used  to  approximate  the  slight  variations  in  the  exit  contour  of  a  self-scheduled 
parallel  work  block. 

For  the  case  of  fixed  length  work  with  an  imbedded  critical  section,  we  see  that 
both  the  single  lock  and  two  lock  barriers  are  close  to  ideally  efficient!  Since  the  criti¬ 
cal  section  requires  processes  to  arrive  in  skewed  order  for  maximum  efficiency,  we  see 
that  it  does  not  hurt  if  the  barrier  implementation  lets  the  processes  out  in  a  skewed 
fashion.  And  since  the  processes  leave  the 
critical  section  in  a  skewed  manner  as  well, 
then  if  the  barrier  requires  a  skewed  entry, 
no  additional  performance  penalty  is 
incurred,  as  shown  in  Figure  3-2a. 


3.2.  Linear  pre-scheduled  barriers 

Instead  of  using  critical  sections,  the 
pre-scheduled  linear  algorithms  require  a 
single  master  to  accept  the  entry  signals 
from  all  other  processes,  one  at  a  time,  in  a 
predetermined  order.  The  pre-scheduled 
barriers  require  less  time  to  complete  each 
stage  of  their  algorithms,  since  they  do  not 
require  a  critical  section  at  each  stage.  A 
profiles  model  of  the  linear  pre-scheduled 
barrier  with  broadcast  exit  is  shown  in  Fig¬ 
ure  3-3. 

The  linear  barrier  with  broadcast  exit 
has  a  depth  of  np+  1  for  the  case  of  fixed 
length  work.  The  symmetric  pre-scheduled 
linear  barrier  occupies  the  master  process 
throughout  both  the  entry  and  exit  parts  of 
the  barrier.  Thus,  the  effective  depth  of 
this  linear  barrier  remains  2*np  for  fixed 
length  work,  since  the  master  must  also 
perform  its  share  of  the  work.  If  the  work  distribution  becomes  variable,  the  pre¬ 
scheduled  linear  barriers  also  show  reduced  depth  but  not  to  the  extent  of  their  self- 
scheduled  counterparts.  For  example,  the  worst  case  scenario  pictured  in  Figure  3-3 
could  not  happen  if  the  entry  contour  of  the  barrier  was  seif-scheduled. 

When  synchronizing  critical  sections,  the  effective  depth  of  the  pre-scheduled 
linear  barriers  is  not  as  good  as  the  effective  depth  of  their  self-scheduled  counterparts. 
If  the  processes  go  through  the  critical  section  in  the  optimal  order,  ie,  master  first, 
then  slaves  in  the  order  of  their  process  ids,  then  the  effective  depth  of  the  barrier  will 
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be  close  to  optimal.  However,  if  the  master  happens  to  be  the  last  process  to  go 
through  the  critical  section,  then  the  next  barrier  iteration  will  have  its  full  depth. 
Thus,  one  would  expect  some  reduction  of  the  effective  barrier  depth  when  synchroniz¬ 
ing  work  containing  critical  sections,  but  not  the  near  optimal  behavior  of  the  self- 
scheduled  linear  barriers. 

3.3.  Logarithmic  tree  barriers 

The  tree  barriers  have  depth  logarithmically  proportional  to  np.  The  profiles  for 
the  tree  barriers,  as  shown  in  Figure  3-4,  are  quite  simple,  since  they  are  rectangular  in 
shape,  with  flat  entry  and  exit  contours.  The  tree  barriers  also  are  analyzed  for  each 
of  the  three  conditions  above,  fixed  length  work,  variable  length  work,  and  work  con¬ 
taining  a  critical  section,  however,  the  analysis  will  be  much  simpler.  Unlike  the  linear 
barriers,  the  effective  depth  of  the  tree  barriers  is  nearly  independent  of  the  type  of 
parallel  work  that  they  synchronize.  No  matter  in  what  order  processes  arrive,  each 
process,  including  the  last,  must  go  through  all  the  stages  of  the  tree.  If  the  process 
arrival  times  are  skewed,  some  variance  in  the  effective  depth  results  since  the  time  a 
process  spends  at  each  stage  varies  slightly  depending  on  whether  it  is  playing  a  mas¬ 
ter  or  slave  role  at  that  node.  But  this  is  only  a  minor  effect.  Thus,  even  if  there  is  a 
wide  variance  in  process  arrival  times,  the  effective  depth  of  the  tree  barrier  remains 
nearly  constant. 

The  analysis  for  tree  barriers  with  broadcast  exit  is  similar.  The  only  difference  is 
that  the  broadcast  exit  reduces  the  depth  from  2*log.,(np)  to  log0(np)+  1.  If  locks  are 
used,  then  each  stage  of  the  tree  would  be  expected  to  have  a  fonger  execution  time 
compared  to  trees  that,  use  only  boolean  variables,  resulting  in  a  longer  total  execution 
time.  Otherwise,  the  analysis  is  unchanged. 


4.  Timing  results 

Timing  runs  on  an  actual  shared  memory  multiprocessor  support  the  predictions 
derived  from  the  profiles  parallel  execution  model  developed  above.  Several  experi¬ 
ments  timing  all  eight  barrier  algorithms  were  run  on  a  Flexible  Computer  Corpora¬ 
tion  Flez/32.  In  order  to  evaluate  the  effect  of  the  number  of  processes  (np)  on  bar¬ 
rier  performance,  np  was  varied  from  two  to  eighteen  in  increments  of  two.  Barriers 
synchronizing  fixed  length  work,  variable  length  work,  and  fixed  length  work  with  an 
imbedded  critical  section  were  timed  in  three  separate  experiments. 

4.1.  Methodology 

Each  experiment  consisted  of  nine  trials;  one  trial  for  each  of  the  eight  barrier 
algorithms,  and  one  trial  simulating  the  behavior  of  the  ideal  barrier.  For  each  of  the 
eight  "regular"  trials,  a  barrier  followed  by  the  parallel  work  block  corresponding  to 
that  experiment  was  placed  in  a  loop,  and  execution  of  this  loop  was  timed  for  100000 
iterations.  The  timing  of  the  ideal  barrier  when  synchronizing  the  various  work  blocks 
was  simulated  using  some  algorithms  described  below.  The  elapsed  time  of  the  ideal 
barrier  trial  was  then  subtracted  from  the  elapsed  times  for  each  of  the  other  eight  tri¬ 
als.  These  resulting  times  were  then  divided  through  by  the  number  of  iterations  of 
the  loop,  yielding  a  measure  of  the  per  iteration  overhead  (effective  depth)  imposed  by 
each  of  the  eight  barrier  algorithms. 

The  first  experiment  consisted  of  timing  barriers  that  synchronized  a  fixed  length 
parallel  work  block.  The  fixed  length  work  required  each  process  to  execute  30  itera¬ 
tions  of  single  precision  multiply  additions  (mul-adds)  and  some  associated  subroutine 
linkage.  This  number  of  mul-adds  is  sufficient  to  insure  that  successive  barriers  will 
not  attempt  to  overlap  with  each  other.  Strictly  private  operands  were  used.  A  bar¬ 
rier  followed  by  the  fixed  length  work  block  was  placed  in  a  loop  and  timed  for  100000 
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iterations.  This  measurement  was  repeated  for  each  of  the  eight  barrier  algorithms. 
The  ideal  barrier  timing  loop  was  simulated  using  a  very  simple  algorithm:  time  only 
the  fixed  length  work  for  100000  iterations. 

The  second  experiment  timed  barriers '  synchronizing  variable  length,  pre¬ 
scheduled,  parallel  work  blocks.  In  a  set  up  phase,  processes  iteratively  filled  private 
arrays,  called  myr&nd,  with  100000  random  numbers.  Processes  also  cooperated  to 
determine  the  maximum  random  number  that  was  generated  on  each  iteration,  and 
these  maximum  values  were  stored  in  separate  private  arrays,  called  m&xval.  The 
random  numbers  ranged  between  30  and  59,  with  a  fiat  distribution  among  these 
values.  A  linear  congruential  random  number  generator  was  used,  and  each  process 
calculated  an  initial  private  seed  by  adding  its  process  id  to  a  single  shared  'starter" 
integer.  Each  iteration  of  the  variable  length  work  block  required  the  processes  to  pull 
a  random  number  from  their  myrand  arrays,  and  then  they  would  execute  that  many 
iterations  of  single  precision  mul-adds.  For  the  timed  part  of  this  experiment,  a  bar¬ 
rier  followed  by  the  variable  length  work  block  was  placed  in  a  loop  and  timed  for 
100000  iterations.  The  ideal  barrier  timing  loop  was  simulated  by  timing  a  single  pro¬ 
cess  executing  100000  iterations  of  "work"  with  no  barriers;  each  "work"  iteration  con¬ 
sisted  of  retrieving  the  maximum  random  number  from  the  m&xv&l  array  and  then 

executing  that  many  mul-adds.  In  this 
manner  an  ideal  barrier  is  simulated,  since 
an  ideal  barrier  would  have  to  wait,  on 
each  iteration,  for  the  process  with  the 
most  work  to  finish. 

Finally,  the  third  experiment  timed 
barriers  synchronizing  fixed  length  work 
with  an  imbedded  critical  section.  Each  of 
these  work  blocks  consisted  of  15  mul- 
adds,  followed  by  a  critical  section,  which 
enclosed  a  single  mul-add,  followed  by  15 
more  mul-adds.  Since  each  of  the  np 
processes  must  execute  the  critical  section 
in  turn  along  with  its  private  mul-adds; 
the  ideal  barrier  timing  loop  is  simulated 
by  timing  30+  np  mul-adds  and  np  sub¬ 
routine  linkages  (in  order  to  simulate  the 
effects  of  the  spinlocks)  on  each  iteration. 
This  experiment  suffers  from  the  difficulty 
of  approximating  the  ideal  barrier  exactly, 
since  the  bus  contention  produced  by  the 
critical  section  cannot  be  exactly 
accounted  for,  and  this  overhead  could 
thus  be  incorrectly  attributed  to  the  barriers. 

These  three  experiments  are  interesting  because  they  approximate  some  typical 
parallel  work  scheduling  mechanisms:  pre-scheduling  and  seif-scheduling  [4].  Pre¬ 
scheduling  does  not  have  the  synchronization  overhead  required  by  self-scheduling 
and  is  efficient  when  work  iterations  are  constant  in  their  execution  time.  Pre¬ 
scheduling  is  often  employed  to  schedule  (non-branching)  parallel  loops.  With  pre- 
scheduling,  work  iterations  are  divided  up  evenly  among  processes,  irrespective  of  the 
execution  time  required  by  each  iteration.  As  is  shown  in  Figure  4-1,  if  processes  enter 
a  homogeneous  pre-scheduled  work  block  in  unison,  they  probably  will  exit  in  a  step 
function,  since  np  likely  will  not  divide  evenly  into  the  number  of  work  iterations. 
The  first  experiment  timing  fixed  length  work  approximates  a  pre-scheduled  parallel 
loop  where  np  does  divide  evenly  into  the  number  of  loop  iterations.  However,  even  if 
processes  exit  in  a  step  function,  we  still  have  the  situation  where  many  processes  exit 
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Profiles:  tree  struct,  barrier 


the  work  block  at  once. 

Self-scheduling  provides  for  load  balancing  and  is  efficient  when  work  iterations 
vary  in  their  execution  times.  With  self-scheduling,  under  protection  of  a  critical  sec¬ 
tion,  processes  take  the  "next"  available  work  descriptor  from  a  shared  scheduling 
mechanism  whenever  they  are  ready  for  additional  work.  In  spite  of  the  load  balanc¬ 
ing  concept,  processes  will  be  somewhat  skewed  in  time  as  they  exit  a  self-scheduled 
parallel  work  block.  This  skew  results  from  variance  in  work  execution  times  and/or 
the  effect  of  the  critical  section  used  to  schedule  the  work  iterations.  The  second  and 
third  experiments  approximate  barriers  synchronizing  self-scheduled  work  since  they 
model  variable  length  work  and  the  effects  of  critical  sections,  respectively. 


4.2.  Computing  environment 


These  three  experiments  were  run  on  a  Flex/32  shared  memory  multiprocessor 
consisting  of  a  shared  memory  store  and  a  set  of  single  board  microcomputers  with 
true  private  memory  on  each  board.  Processes  may  be  bound  to  processors,  so  unex¬ 
pected  process  suspension  is  not  a  major  issue.  The  Flex/32  supports  efficient 
implementation  of  spinlocks  through  a 
hardware  test  and  set  machine  language 
instruction  that  is  available  to  the  user. 


Memory  accesses  into  private  data  struc¬ 
tures  and  instruction  fetches  do  not  inter¬ 
fere  with  shared  memory  cycles  on  the  dual 
common  bus.  A  proprietary  architecture 
interfaces  local  busses  with  a  dual  common 
bus  connected  to  shared  memory  Unfor¬ 
tunately,  Flexible  Computer  Corporation 
has  not  published  detailed  descriptions  of 
these  interfaces.  Flexible  provides  MMOS, 
its  distributed  "multicomputing”  operating 
system  [10],  to  supervise  parallel  programs. 
The  Flex/32  used  belonged  to  the  NASA 
Langley  Computational  Structural 
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Exampl*:  Ut  np  «  4.  If  there  are  IS 
units  of  work  to  be  scheduled,  with 
pre-scheduling  each  process  re¬ 
ceives  either  3  or  4  work  iterations. 


Mechanics  Group  in  Hampton,  Virginia. 
NASA’s  Flex/32  is  configured  with  eigh¬ 
teen  processors  able  to  run  in  parallel. 

4.3.  Sources  of  timing  error 


Figure  4-1 

Profiles:  a  pre-sched  work  block 


Before  describing  the  curves,  let  us 
examine  some  sources  of  timing  errors. 

The  clock  function  on  the  Flex/32  is  imple¬ 
mented  through  software  using  a  system 
interrupt.  These  system  interrupts 

increment  a  private  clock  (an  integer  variable).  The  timer  granularity  was  one  second; 
since  100000  iterations  were  timed,  the  timing  granularity  per  iteration  is  is  reduced  to 
.01  millisecond  (ms).  Since  the  effective  depth  of  the  barriers  per  iteration  was  calcu¬ 
lated  as  the  difference  between  two  elapsed  times,  the  error  due  to  timing  granularity 
(per  iteration)  is  within  ±  .02  ms.  Over  twenty  four  hours  of  parallel  cpu  time  was 
required  in  order  to  achieve  this  low  timing  granularity. 


There  is  an  additional  source  of  error.  Each  processor  receives  the  system  inter¬ 
rupt  at  the  specified  frequency.  These  interrupts  occur  asynchronously  for  the 
processes  in  a  round  robin  fashion.  The  duration  of  this  interrupt  has  been  measured 
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(o  be  approximately  .3  ms  in  duration.  In  order  to  minimize  the  effects  of  the  system 
interrupts,  the  timing  program  was  compiled  with  a  configuration  specifying  a  fre¬ 
quency  of  only  one  interrupt  per  second,  hence  a  one  second  timing  granularity.  (A 
lower  frequency  of  interrupts  is  not  possible  under  MMOS.)  A  .3  ms  interrupt  per 
second  represents  only  .03%  of  cpu  time  per  processor.  However,  since  the  interrupts 
occur  asynchronously,  when  all  18  processors  are  being  used,  then  during  .54%  of  the 
time  one  of  the  processors  will  be  servicing  the  interrupt.  This  is  the  parameter  of 
interest  when  timing  barriers!  Fortunately,  this  magnitude  of  process  suspension  will 
not  significantly  distort  timing  measurements.  It  should  be  noted  that  NASA’s 
Flex/32  has  a  default  configuration  of  50  MMOS  system  interrupts  per  second.  While 
this  configuration  reduces  the  timer  granularity  to  20  ms,  the  percentage  of  time  dur¬ 
ing  which  one  of  the  processors  is  suspended  is  increased  to  a  whopping  27%,  clearly 
unacceptable  for  timing  barriers. 

In  order  to  insure  that  the  compiler  expands  the  timing  loops  identically  for  each 
of  the  barrier  algorithms  being  timed,  the  barriers  were  executed  via  subroutine  calls 
from  within  the  timing  loops.  In  this  manner  ail  the  timing  loops  are  guaranteed  to 
have  the  identical  machine  code,  thereby  eliminating  a  subtle  source  of  timing  bias 
that  could  be  present  if  the  barriers  were  expanded  in-line  within  the  timing  loops. 

4.4.  Results 

The  results  for  each  of  the  three  experiments  arc  plotted  in  Figures  4-2,  4-3,  and 
4-4,  Each  figure  shows  curves  corresponding  to  the  various  barriers,  with  different 
values  of  np  plotted  against  effective  execution  time  (milliseconds  per  barrier  itera¬ 
tion).  The  effective  execution  time  of  a  barrier  is  defined  as  the  difference  between  the 
execution  time  of  the  work  and  barrier  combination  and  the  execution  time  of  the 
work  synchronized  by  an  ideal  barrier.  All  three  figures  are  plotted  using  the  same 
time  scale,  allowing  for  comparison  between  figures. 

For  the  case  of  barriers  synchronizing  fixed  length  work,  Figure  4-2  plots  the 
observed  effective  depth  of  all  eight  barriers  against  np.  One  observation  is  that  the 
logarithmic  tree  barriers  show  better  performance  than  the  linear  barriers,  even  for 
small  values  of  np.  Also,  the  broadcast  barriers  (those  using  the  polarity  exit  mechan¬ 
ism)  show  superior  performance  than  their  symmetric  (identical  entry  &  signal  struc¬ 
ture)  counterparts.  Another  observation  is  that  the  barriers  that  use  spinlock  routines 
show  marked  performance  degradation  as  np  becomes  large.  This  effect  may  be  attri¬ 
buted  to  increased  bus  competition  that  forces  shared  memory  bus  requests  to  line  up 
in  a  queue.  If  many  processes  are  competing  for  access  to  a  lock,  one  might  think  that 
no  performance  degradation  would  result,  since  one  of  the  processes  should  be  succeed¬ 
ing.  even  if  others  are  having  t  heir  bus  requests  delayed.  This  is  indeed  the  case;  how¬ 
ever.  inefficiency  is  introduced  when  the  owner  of  a  lock  must  compete  for  shared  bus 
access  in  order  to  unlock  it.  If  18  processes  are  competing  randomly,  the  average 
unlock  command  requires  around  17  attempts  before  succeeding,  assuming  a  single 
shared  bus  with  random  arbitration.  The  situation  with  Flexible's  dual  bus  is  less 
clear,  but  this  same  type  of  effect  is  probably  occurring.  Since  the  critical  sections 
themselves  are  very  short,  increasing  the  number  of  bus  cycles  required  by  the  unlock 
instruction  will  significantly  degrade  performance,  and  this  is  evident  from  the  plot. 
Apparently,  it  is  the  read— write  cycles  that  place  the  greatest  burden  on  the  shared 
memory  bus.  The  bus  contention  appears  to  be  much  less  for  the  barriers  that  do  not 
use  spinlocks. 

Figure  4-3  shows  the  barrier  performance  when  synchronizing  variable  length 
parallel  work  blocks.  As  the  profiles  model  predicts,  the  linear  barriers  are  better  able 
to  exploit  the  variation  in  process  arrival  times.  One  interesting  feature  is  that  the 
linear  barriers  with  broadcast  exit,  do  a  better  job  than  those  that  have  linear  exit 
phases.  In  efforts  to  prevent  visual  clutter,  Figures  4-3  and  4-4  do  not  plot  curves  for 
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effective  depth 
(  milliseconds) 


barlin: 

linear  (pre-sched), 

no  locks, 

symmetric  entry  &  signal  phases 

barlok: 

linear  (self-ached), 

locks, 

symmetric 

bsrtre: 

tree  structured, 

no  locks, 

symmetric 

bbrlin: 

linear  (pre-sched), 

no  locks. 

broadcast  exit  signal 

bbrlok: 

linear  (self-sched), 

locks, 

broadcast 

bbrtre: 

tree  structured, 

no  locks, 

broadcast 

Figure  4-3  Timing  results  for  barriers 
synchronizing  variable-length  work 
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eflecuve  depth 
(  milliseconds ) 


barlin: 

linear  (pre-sched), 

no  locks, 

symmetric  entry  k  signal  phases 

barlok: 

linear  (self-sehed), 

locks, 

symmetric 

bmrtre: 

tree  structured, 

no  locks, 

symmetric 

bbrlin: 

linear  (pre-sched), 

no  locks, 

broadcast  exit  signal 

bbrlok: 

linear  (self-sched), 

locks, 

broadcast 

bbrtre: 

tree  structured, 

no  tocks, 

broadcast 

Figure  4-4  Timing  results  for  barriers 
synchronising  fixed  work  containing 
a  critical  section 
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the  tree  barriers  that  use  locks.  However,  timing  measurements  were  made  for  these 
algorithms  as  well,  and  the  tree  barriers  with  locks  showed  substantially  greater  over¬ 
head  than  the  tree  barriers  without  locks. 

Figure  4-4  plots  the  barrier  performance  when  synchronizing  work  blocks  with 
critical  sections.  For  the  case  of  critical  sections,  the  barriers  that  had  the  worst  per¬ 
formance  in  the  fixed  length  work  experiment  now  show  the  best  performance!  Even 
for  the  larger  values  of  np,  the  linear  barriers  have  a  small  depth  that  remains  nearly 
independent  of  np,  whereas  the  tree  structured  barriers  show  their  normal  logarithmic 
growth.  Although  Figure  4-4  may  look  cluttered,  it  would  be  misleading  to  provide 
more  detail  by  enlarging  the  time  scale  since  the  timing  granularity  is  within  ±  .02 
ms. 


5.  Conclusion* 

Three  concepts  were  isolated  in  the  development  of  these  barrier  algorithms.  A 
barrier  may  have  linear  or  tree  structured  communication  patterns.  A  barrier  may 
have  symmetric  entry  and  signal  phases,  or  the  signal  phase  may  use  a  single  broadcast 
exit  signal.  And  synchronization  within  a  barrier  may  rely  solely  upon  memory 
accesses  into  shared  data  structures,  or  algorithms  may  use  locks  and  their  associated 
spinlock  routines.  For  the  sake  of  completeness,  and  to  provide  for  a  thorough  founda¬ 
tion  upon  which  to  make  comparisons,  all  eight  combinations  of  these  three  concepts 
were  realized  as  barrier  algorithms.  Specifying  which  of  these  barriers  is  the  "best"  is 
not  so  easy  a  task,  since  there  are  several  trade  offs  involved  and  different  machine 
architectures  may  favor  different  barrier  implementations. 

5.1.  Analysis 

If  we  have  parallel  routines  that  can  be  executed  by  an  arbitrary  number  of 
processes,  then  the  speedup  of  a  parallel  routine  can  be  defined  to  be  the  ratio  of  the 
single  process  execution  time  of  the  routine  (without  synchronization  overheads) 
against  the  parallel  execution  time  (including  synchronization  overheads).  Efficiency  is 
defined  as  the  ratio  of  the  speedup  to  np.  A  parallel  work  lead  that  is  not  evenly  bal¬ 
anced  among  processes  between  barrier  iterations  is  a  primary  cause  of  loss  of 
efficiency.  The  barrier  algorithm  itself  may  also  contribute  to  the  inefficiency  of  a 
parallel  program. 

Introducing  optimized  barriers  into  existing  programs  tends  to  result  in  only 
minor  improvement  in  the  speedup  if  these  programs  were  not  "barrier  bound"  to 
begin  with.  Optimizing  the  barrier’s  execution  time  delivers  instead  a  different  payoff: 
the  threshold  size  of  work  blocks  that  may  profitably  be  parallelized  is  decreased. 
Equation  (8)  shows  the  formula  for  speedup  when  considering  only  a  single  parallel 
work  block  followed  by  a  barrier,  where  Wnp  is  the  time  required  for  np  processes  to 
execute  the  parallel  work.  W,  is  the  single  process  execution  time,  and  Bnp  is  the 
effective  barrier  execution  time.  If  WBp  >>  Bnp,  then  decreasing  Bnp  will  not 
improve  the  speedup  by  much.  However,  if  Bnp  is  of  similar  or  greater  magnitude  than 
Wnp,  then  decreasing  Bop  will  substantially  increase  the  speedup.  Thus  reducing  Bnp, 
the  barrier  effective  execution  time,  improves  the  speedup  when  small  chunks  of  work 
(followed  by  a  barrier)  are  parallelized,  and  also  allows  programmers  to  profitably 
employ  barriers  to  synchronize  yet  smaller  parallel  work  blocks. 

(8) 
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Wlien  synchronizing  18  processes,  the  effective  execution  times  of  these  barriers 
on  Flex/32  ranged  between  .12  ms  to  1.32  ms  across  all  the  experiments;  with  ranges 
of  .20  ms  to  1.32  ms  for  fixed  length  work,  .15  ms  to  1.07  ms  for  variable  length  work, 
and  .12  ms  to  .38  ms  for  fixed  length  work  with  an  imbedded  critical  section.  The 
variance  in  these  times  is  mostly  due  to  the  linear  algorithms,  whose  performance  is 
quite  dependent  on  the  type  of  work  that  they  synchronize.  The  tree  barriers  had 
much  more  stable  execution  times  across  the  experiments.  For  reference,  when  syn¬ 
chronizing  18  processes,  bartre  ranged  between  .38  to  .44  ms  per  iteration  across  the 
three  experiments,  and  bbrtre  ranged  between  .26  to  .29  ms. 

In  order  to  place  the  barrier  execution  times  into  perspective,  let  us  compare  their 
effective  depth  with  the  execution  speed  of  the  following  single  precision  Fortran  vec¬ 
tor  calculation:  C(i)  *  C(i)+  A(i)*B(i).  The  National  Semiconductor  NS32032s  used 
in  the  Flex/32  (Greenhills  compiler)  require  about  .038  ms  to  compute  this  sum  and 
product  for  each  iteration  of  the  vector  index  i.  This  measurement  includes  Fortran 
DO  loop  overhead  and  index  calculations  as  well  as  the  floating  point  multiplication 
and  addition.  Thus,  the  range  of  barrier  times,  .12  to  1.32  ms,  maps  into  a  range  of  3 
to  35  of  these  sequential  vector  element  muftiply-additions.  As  an  example,  the 
effective  depth  of  bbrtre  synchronizing  18  processes,  .26  ms,  is  roughly  equivalent  to  7 
of  these  vector  element  multiply-additions. 

The  primary  advantage  of  the  tree  barriers  is  their  logarithmic  depth.  As  the 
number  of  processes,  np,  becomes  large,  this  advantage  becomes  overwhelming,  as 
demonstrated  in  Figure  4-2,  the  timing  results  for  barriers  synchronizing  fixed  length 
work.  Although  the  per  stage  execution  times  of  the  tree  barriers  are  higher  than 
those  of  the  linear  barriers,  considering  Figure  4-2,  we  see  that  the  tree  barriers  would 
all  be  expected  to  overtake  their  linear  counterparts  as  np  becomes  large  enough.  For 
example,  using  the  results  obtained  on  the  Flex/32,  bbrtre  would  be  expected  to  over¬ 
take  even  bbrlin,  the  most  efficient  linear  barrier  synchronizing  fixed  length  work,  for 
values  of  np  near  20.  Yet,  the  linear  barriers  are  able  to  improve  their  performance  as 
process  arrival  times  become  increasingly  staggered,  while  the  depth  of  a  tree  barrier  is 
nearly  invariant  with  respect  to  process  arrival  behavior.  For  example,  if  there  are 
critical  sections  between  successive  barrier  iterations,  then  the  self-scheduled  linear 
barriers  (barlok,  bbrlok)  are  almost  ideally  efficient,  while  the  depth  of  a  tree  barrier 
remains  proportional  to  log.,(np). 

The  tree  barriers  presented  in  this  paper  have  a  logarithmic  depth  similar  to  that 
of  the  proposed  butterfly  barrier  (7],  However  the  data  requirements  and  svnehroniza- 
tional  complexity  of  the  trees  are  substantially  lower,  both  O(np),  rather  than 
0(np*log„(np))  as  is  the  case  with  the  butterfly  barrier.  Synchronizational  complexity 
is  defined  as  the  number  of  communications  between  processes.  Consider  the  tree  bar¬ 
rier,  removing  the  exit  phase,  in  relation  to  the  butterfly  barrier.  If  one  takes  np 
trees,  letting  each  process  be  the  root  of  one  of  these  trees,  and  then  one  superimposes 
all  np  of  these  trees  removing  redundancies,  then  the  butterfly  topology  results.  Thus 
symmetry  is  achieved  at  a  cost  of  superimposing  np  master-slave  tree  topologies.  The 
butterfly  barrier  distributes  the  function  or  master  to  every  process,  requiring  each 
process  to  determine  independently  that  all  others  have  arrived.  One  could  argue 
that,  depending  on  machine  architecture,  the  butterfly  barrier  would  probably  hare 
inferior  performance  compared  to  the  tree,  due  to  the  sheer  magnitude  ( 
Ofnp'ologolnp))  )  of  shared  memory  accesses.  While  the  inherent  symmetry  of  the 
butterfly  Barrier  is  aesthetically  appealing,  this  is  quite  a  price  to  be  paid  for  that  sym¬ 
metry. 

On  the  Flex/32,  the  tree  barriers  that  use  only  controlled  access  to  synchronizing 
variables  are  more  efficient  than  those  that  use  spinlocks.  One  observation  is  that  the 
tree  barriers  using  locks  tended  to  show  nearly  linear  depth  on  the  Flex/32,  due  to  the 
bus  contention  problem  caused  by  the  layers  of  spinlocks.  In  fact,  the  indivisible 
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read— write  bus  cycles  are  unnecessary  for  the  tree  barriers,  and  they  tie  up  the  shared 
memory  bus(es)  longer  than  necessary.  In  general,  spinlocks  tend  to  require  subroutine 
linkage  or  possibly  inefficient  operating  system  calls,  and  the  spinlocks  involve  addi¬ 
tional  computational  steps  than  the  set/clear  mechanisms.  However,  one  should  keep 
in  mind  that  spinlocks  may  provide  superior  performance  on  machines  with  special 
hardware  supporting  locks. 

The  issue  of  whether  to  use  spinlocks  or  not  is  a  different  matter  all  together  for 
the  linear  barriers.  The  linear  spinlock  barriers  allow  the  processes  to  arrive  in  any 
order,  aelf-acheduling  the  entry  into  the  critical  section,  and  they  very  effectively 
exploit  any  variation  in  their  arrival  times.  The  pre-schedulea  linear  barriers  (using 
set/clear  mechanisms)  require  a  fixed  order  of  arrival  of  the  processes  in  order  to 
achieve  their  best  performance.  These  barriers  are  also  able  to  exploit  variations  in 
process  arrival  times,  but  to  a  lesser  degree  than  their  self-scheduled  counterparts. 
Thus,  the  self-scheduled  barriers  that  use  spinlock  routines  are  appealing.  However, 
consider  the  following  trade  off.  On  one  hand,  as  demonstrated  in  Figures  4-3  and  4-4, 
the  self-scheduled  algorithms  have  better  performance  when  process  arrival  patterns 
are  significantly  skewed.  However,  due  to  the  critical  sections,  each  stage  of  the  self- 
scheduled  barriers  requires  more  execution  time  than  the  corresponding  stage  of  the 
pre-scheduled  barriers.  This  situation  occurs  even  on  the  Flex/32  which  supports  a 
machine  language  test  and  set  instruction  used  to  implement  the  critical  sections.  So 
on  the  other  hand,  when  processes  arrive  all  at  once  and  the  effective  depth  of  a  linear 
barrier  is  the  sum  of  its  stages,  then  the  linear  pre-scheduled  barriers  show  better  per¬ 
formance  than  the  self-scheduled  barriers,  as  shown  in  Figure  4-2.  ( 

All  of  the  barrier  algorithms  developed  in  this  paper  have  analogous  versions 
using  either  symmetric  entry  and  signal  phases,  or  the  broadcast  exit/polarity  idea 
developed  above.  On  the  Flex/32,  the  broadcast  versions  show  superior  performance 
than  their  symmetric  counterparts.  The  broadcast  exit  reduces  the  exit  depth  from  np 
or  log2(np)  to  one,  while  requiring  only  minimal  computation.  If  the  underlying 
machine  hardware  supports  true  parallel  reads  of  shared  data,  then  the  broadcast  exit 
mechanism  is  almost  ideally  efficient.  If  the  machine  hardware  does  not  support  true 
parallel  reads,  then  the  situation  where  many  processes  compete  to  read  the  exit  vari¬ 
able  is  like  a  linear  critical  section,  but  with  a  very  short  time  quantum,  a  single 
shared  memory  bus  cycle.  Given  this  situation,  a  very  large  number  of  processes,  and 
a  machine  with  multiple,  hashed,  shared  memory  modules,  where  memory  references  to 
distinct  modules  may  proceed  in  parallel,  then  the  symmetric  tree  barriers  (bartre, 
bartrl)  could  conceivably  yield  better  performance  since  they  eliminate  the  competi¬ 
tion  to  read  the  single  exit  variable.  Care  would  have  to  be  taken  to  insure  that  the 
synchronizing  variables  are  kept  in  different  memory  modules. 

5.2.  Recommendations 

It  is  an  interesting  result  that  the  tree  barriers  show  better  performance  for  the 
case  of  fixed  length  work,  while  the  linear  self-scheduled  barriers  show  improved  per¬ 
formance  for  variable  length  work  and  better  performance  for  fixed  length  work  that 
contains  a  critical  section.  One  consequence  is  that  linear  barriers  are  well  suited  to 
synchronizing  self-scheduled  parallel  loops,  while  tree  barriers  are  better  suited  to  syn¬ 
chronizing  pre-scheduled  homogeneous  loops.  For  finely  tuned  applications,  it  may  be 
desirable  to  tailor  the  barrier  to  the  work  it  synchronizes  in  order  to  achieve  optimal 
performance.  Perhaps  in  the  future,  an  intelligent  compiler  may  be  able  to  make  this 
decision  on  a  case  by  case  basis.  However,  in  the  present  day  for  general  applications 
it  would  seem  easier  to  decide  on  a  single  default  barrier  in  order  to  insulate  the  paral¬ 
lel  programmer  from  this  type  of  decision. 

Before  selecting  a  default  barrier  for  use  on  a  particular  machine  architecture,  it 
would  be  wise  to  try  out  several  of  the  algorithms,  due  to  the  wide  variance  and 
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peculiarity  of  the  shared  memory  multiprocessors  currently  available.  However,  if  gen¬ 
eral  recommendations  can  be  made,  then  the  barriers  should  be  chosen  based  on  what¬ 
ever  desirable  theoretical  attributes  they  possess.  For  larger  values  of  np  (np  >  '8), 
bbrtre,  the  tree  broadcast  barrier  without  locks,  is  recommended  for  general  applica¬ 
tions  due  to  its  logarithmic  depth  and  excellent  execution  times.  For  the  smaller 
values  of  np  (np  <  8)  and/or  applications  with  many  critical  sections,  bbrlok,  the 
self-scheduled  linear  barrier  with  broadcast  exit  is  also  a  good  choice.  This  barrier 
always  delivers  good  performance  for  small  values  of  np,  and  for  larger  values  of  np  it 
performs  well  when  it  is  able  to  exploit  the  run  ,:me  conditions  associated  with 
significant  process  skew.  Both  of  these  barriers,  coded  in  an  extended  Fortran,  are 
shown  in  Appendix  A. 
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Appendix  A 


Fortran  code  for 

Berlin,  Barlok,  Bartre,  Bartrl,  Bbrlin,  Bbrlok,  Bbrtre,  Bbrtrl 


The  following  eight  barriers,  coded  in  an  extended  Fortran,  are  included  in  this 
Appendix.  The  extended  Fortran  -Ho«*  Private  and  Shared  declarations  as  well  as 
sp'alock  and  unlock  primitives.  In  all  cases,  the  unlock  statement  denotes  a  simple 
unconditional  unlock  routine. 


barlin: 

linear  (pre-sched), 

no  locks, 

symmetric  entry  &  signal 

barlok: 

linear  (self-sched), 

locks, 

symmetric 

bartre: 

tree  structured, 

no  locks, 

symmetric 

bartrl: 

tree  structured, 

locks, 

symmetric 

bbrlin: 

linear  (pre-sched), 

no  locks, 

broadcast  exit  signal 

bbrlok: 

linear  (self-sched), 

locks. 

broadcast 

bbrtre: 

tree  structured, 

no  locks, 

broadcast 

bbrtrl: 

tree  structured, 

locks, 

broadcast 

These  barriers  require  a  single  a  priori  initialization.  One  process  must  execute  a 
call  to  Shjnit  once  before  any  barriers  are  executed,  in  order  to  initialize  the  shared 
variables.  Normally  the  process  that  forks  the  other  processes  can  call  Shjnit.  The 
argument  to  Shjnit  should  be  np,  the  number  of  processes  participating  in  the  bar¬ 
rier.  Also  each  process  must  execute  a  call  to  Prjnit  once  in  order  to  initialize  its 
private  data  structures.  The  arguments  to  Prjnit  are  the  process  id,  (numbered  from 
one  to  np,  unique  for  each  process)  and  np,  the  number  of  participating  processes. 

These  barriers  are  coded  using  subroutines  calls.  The  process  with  id"l  will 
always  execute  the  sequential  part.  A  typical  expansion  of  the  barriers  follows. 


C  Begin  Barrier 

CALL  B3r_entry 
if  (id  .eq.  1 )  then 


[<  optional  sequential  code  block  >] 

C  End  Barrier 

CALL  Bar_?ignal 
end  if 
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*  barlin:  linear  (pre-scheduled),  no  locks,  symmetric  structure 


Subroutine  Barratry 

Shared  common  /Shbar/  logical  LARRAY(20) 
Private  common  /Prbar/  integer  id,  np 
if  (id.  eq.  1)  then 
do  10  i  =  2,np 

20  if(LARRAY(i)  .eq.  .true.)  goto  20 

10  continue 
else 

LARRAY(id)  -  .false, 
call  Bar  .signal 
end  if 
return 
end 

Subroutine  Bar.signal 

Shared  comm  -n  /Shbar/  logical  LARRAY(20) 
Private  common  /Prbar/  integer  id,  np 
if  (id.  eq.  1)  then 
do  10  i  =  2,np 

LARRAY(i)=.true. 

10  continue 

else 

20  if  (LARRAY(id)  .eq.  .false.)  goto  20 

end  if 
return 
end 


**•**•***•««.  ******************************** ********************************** 


Subroutine  Prjnit(tid,tnp) 
integer  tid.tnp 

Private  common  /Prbar/  integer  id,  np 
id  =  tid 
np  =  tnp 
return 
end 

Subroutine  Sh_jnit(tnp) 
integer  tnp 

Shared  common  /Shbar/  logical  LARRAY(20) 
do  10  i  =  2.np 

LARRAY(i)*  .true. 

10  continue 
return 
end 
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*  barlok:  linear  (self-scheduled),  locks,  symmetric  structure 

****************************************************************************** 

Subroutine  Bar_entry 

Shared  common  /Shhar/  logical  ENTRY,  EXIT 
Shared  common  /Shbar/  integer  COUNTER 
Private  common  /Prbar/  integer  id,  np,  mycount 
3p»nlock(  ENTRY) 
mycount  =  COUNTER  +  1 
COUNTER  =  mycount 
if  (mycount  up;  uulock(ENTRY) 
if  (id  eq.  I)  then 

10  if  (COUNTER  ,ne.  np)  goto  10 
return 
else 

rpmlock(EXIT) 

end  if 

mycount  =  COUNTER  -  1 
COUNTER  =  mycount 
if  (mvcount  eq.  0)  then 
unlock(ENTRY) 
else 

unlock(EXIT) 
end  if 
return 
end 

Subroutine  Bar_gignal 

Shared  common  /Shbar/  logical  ENTRY,  EXIT 
Shared  common  /Shbar/  integer  COUNTER 
Private  common  /Prbar/  integer  id,  np,  mvcount 
COUNTER  =  COUNTER  -  1 
unlock(EXIT) 
return 
end 

****************************************************************************** 

Subroutine  Pr Jnit(tid.tnp) 
integer  tid.tnp 

Private  common  /Prbar/  integer  id,  np,  mycount 
id  =  tid 
np  =  tnp 
return 
end 

Subroutine  Shjnit(tnp) 
integer  tnp 

Shared  common  /Shbar/  logical  ENTRY.  EXIT 
Shared  common  / Shbar/  integer  COUNTER 
COUNTER  =  0 
unlock(ENTRY) 
unlock(EXIT) 
epmlock(EXIT) 
return 
end 
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*  bartre:  tree  structured,  no  locks,  symmetric  structure 


Subroutine  Bar_#ntry 

Shared  common  /Sbbar/  logical  LARRAY(20) 
Private  common  /Prbar/  integer  id,  np,  lim 
10  lim=  lim/2 
20  if  (id  .le.  lim)  then 

if  ((id+lim)  .gt.  np)  goto  10 
30  if  (LARRAY(id+lim)  .eq.  .true.)  goto  30 
goto  iO 
end  if 

LARRAY(id)  =  .false, 
if  (id  .ne.  1)  call  Bar_signal 
return 
end 

Subroutine  Bar_signal 

Shared  common  /Shbar/  logical  LARRAY(20) 
Private  common  /Prbar/  integer  id,  np,  lim 
if  (id  .ne.  1)  goto  10 
lim  =  1 
goto  30 

10  if  (LARRAY(id)  .eq.  .false.)  goto  10 
20  lim  =  lim  *  2 
30  if((id+lim)  le.  np)  then 

LARRAY(id+lim)  =  .true, 
goto  20 
end  if 
return 
end 


Subroutine  Prjnit(tid.tnp) 

integer  tid.tnp 

Private  common  /Prbar/  integer  id,  np,  lim 
id  =  tid 
np  =  tnp 

C  initialize  lim  such  that:  lim  =  2**n  >  =  np  >  2**(n-l) 
lim  =  1 

10  if  (lim  It.  np)  then 
lim  =  lim  *  2 
goto  10 
end  if 
return 
end 

Subroutine  Shjnit(tnp) 

integer  tnp 

Shared  common  /Shbar/  logical  LARRAY(20) 
do  10  i  =  2.np 

LARRAY(i)  =  true. 

10  continue 
return 
end 
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****************************************************************************** 

*  bartrl:  tree  structured,  locks,  symmetric  structure 

****************************************************************************** 

Subroutine  Bar_pntry 

Shared  common  /Shbar/  logical  INARRAY(20),  OUTARRAY(20) 

Private  common  /Prbar /  integer  id,  np,  lim 
10  lim  =  lim/2 
20  if  (id  le.  lim)  then 

if  ((id+lim)  gt.  np)  goto  10 
30  j?pinlock(lNARRAY(  id+lim)) 
goto  10 
end  if 

unlock(INARRAY(id)) 

if  (id  ne.  1)  CALL  6ar_signal 

return 

end 

Subroutine  Bar_$ignal 

Shared  common  /Shbar/  logical  INARRAY(20),  OUTARRAY(20) 

Priv  ate  common  /Prbar/  integer  id,  np,  lim 
if  (id  ne.  1)  goto  10 
lim  =  1 
goto  30 

10  ^pinlock(OL:TARRAY(id)) 

20  lim  =  lim*2 
30  if((id+lim)  .le.  np)  then 

unlock(OUTARRAY(  id+lim)) 
goto  20 
end  if 
return 
end 

******•***********•*****•*******«*«********•***•**•***•*•••*•*»••****•***•**•• 

Subroutine  Prjnit(tid.tnp) 
integer  tid.tnp 

Private  common  /Prbar/  integer  id,  np,  lim 
id  =  tid 
np  =  tnp 

C  initialize  lim  such  that:  lim=2**n  >=  np  >  2**(n-l) 
lim  =  1 

10  if  (lim  It.  np)  then 
lim  =  lim  *  2 
goto  10 
end  if 
return 
end 

Subroutine  Shjnit(tnp) 
integer  tnp 

Shared  common  /Shbar/  logical  INARRAY(20),  OUTARRAY(20) 
do  10  i=  l.np 

unlock(INARRAY(i)) 

unlock(OUTARRAY(i)) 

spinlock(lNARRAY(i)) 

»p»nlock(OUTARRAY(i)) 

10  continue 
return 
end 
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*  bbrlin:  linear  (pre-scheduled),  no  locks,  broadcast  exit 
Subroutine  Bar^ntry 

Shared  common  /Shbar/  logical  LARRAY(20) 
Private  common  /Prbar/  integer  id,  up,  polarity 
if  (id  eq.  1)  then 
do  10  i  =  2,np 

20  if  (LARRAY(i)  .ne.  polarity)  goto  20 

10  continue 

else 

LARRAY(id)  =  polarity 
polarity  ■  .not.  polarity 
30  if  (LARRAY(l)  .eq.  polarity)  goto  30 

end  if 
return 
end 

Subroutine  Bar_signal 

Shared  common  /Shbar/  logical  LARRAY(20) 
Private  common  /Prbar/  integer  id,  np,  polarity 
LARRAY(l)  «  polarity 
polarity  =*  .not.  polarity 
return 
end 


•a**************************************************************************** 


Subroutine  Prjnit(tid.tnp) 
integer  tid.tnp 

Private  common  /Prbar/  integer  id,  np,  polarity 
id  =  tid 
np  =  tnp 
polarity  =  .true. 
return 
end 

Subroutine  Sh_init(tnp) 
integer  tnp 

Shared  common  /Shbar/  logical  LARRAY(20) 
do  10  i=  l.np 

LARRAY(i)=  .false. 

10  continue 
return 
end 
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*********«»«**«*************************************************** ************ 

*  bbflok:  linear  (self-scheduled),  locks,  broadcast  exit 

****************************************************************************** 

Subroutine  Bar_entry 

Shared  common  /Shbar/  logical  ENTRY,  EXIT 
Shared  common  /Shbar/  integer  COUNTER 
Private  common  /Prbar/  integer  id,  np,  polarity 
if  (id  erj.  1)  then 

10  if  (COUNTER  .ne.  0)  goto  10 
else 

eptnlock(  ENTRY) 

COUNTER  =  COUNTER- 1 
unlock(ENTRY) 
polarity  =  .not.  polarity 
20  if  (EXIT  .eq.  polarity)  goto  20 
end  if 
return 
end 

Subroutine  Bar_signal 

Shared  common  /Shbar/  logical  ENTRY,  EXIT 
Shared  common  /Shbar/  integer  COUNTER 
Private  common  /Prbar/  integer  id,  np,  polarity 
COUNTER  =  np  -  1 
EXIT  =  polarity 
polarity  =  .not.  polarity 
return 
end 


•a**************************************************************************** 


Subroutine  Pr_jnit(tid,tnp) 
integer  tid.tnp 

Private  common  / Prbar/  integer  id,  np,  polarity 
id  =  tid 
np  =  tnp 
polarity  =.true. 
return 
end 

Subroutine  Sh_jnit(tnp) 
integer  tnp 

Shared  common  /Shbar/  logical  ENTRY,  EXIT 
COUNTER  =  np-1 
EXIT  =  .false. 
unlock(ENTRY) 
return 
end 
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*  bbrtre:  tree  structured,  no  locks,  broadcast  exit 


Subroutine  Bar_entry 

Shared  common  /Sbbar/  logical  LARRAY(20) 
Private  common  /Prbar/  integer  id,  np,  lim,  polarity 
Private  integer  ilim,  isum 
ilim  =  lim 
goto  JO 

10  ilim= ilim/2 
JO  if  (id  le.  ilim)  then 
isum  =  id  +  ilim 
if  (isum  .gt.  np)  goto  10 

30  if  (LARRAY(isum)  .ne.  polarity)  goto  30 
goto  10 
end  if 

if  (id  ne.  1)  then 

LARRAY(id)  =  polarity 
polarity  =  .not.  polarity 
•40  if  (LARRAY(l)  ,eq.  polarity)  goto  40 
end  if 
return 
end 

Subroutine  Bar  .signal 

Shared  common  /Shbar/  logical  LARRAY(  JO) 
Private  common  /Prbar/  integer  id,  np,  lim,  polarity 
LARRAY(l)  -  polarity 
polarity  =  .not.  polarity 
return 
end 


Subroutine  Prjnit(tid.tnp) 
integer  tid.tnp 

Private  common  /Prbar /  integer  id,  np,  lim,  polarity 
id  =  tid 
np  =  tnp 
polarity  =.true. 

C  initialize  lim  such  that:  lim=2**n  <  np  <  =  2**(n+l) 
lim  =  l 

10  if  (lim  It.  np)  then 
lim  =  lirp  *  2 
goto  10 
end  if 

lim  =  lim  /  2 

return 

end 

Subroutine  Sbjnit(tnp) 
integer  tnp 

Shared  common  /Shbar/  logical  LARRAY(20) 
do  20  i  =  l,tnp 

LARRAY(i)  =  .false. 

20  continue 
return 
end 
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*  bbrtrl:  tree  structured,  locks,  broadcast  exit 


Subroutine  Bar_entry 

Shared  common  /Shbar/  logical  INARRAY(20) 

Private  common  /Prbar/  integer  id,  np,  lim,  polarity 
Private  integer  ilim,  isum 
ilim  =  lim 
goto  20 

10  ilim  =  ilim/2 
20  if  (id  .le.  ilim)  then 
isum  =  id  +  ilim 
if  (isum  .gt.  np)  goto  10 
.30  apinlock(INARRAY(isum)) 

goto  10 
end  if 

if  (id  ne.  1)  then 

unlock(lNARRAY(id)) 
polarity  =  not.  polarity 
40  if  (INARRAY(l)  .eq.  polarity)  goto  40 

end  if 
return 
end 

Subroutine  Bar_signal 

Shared  common  /Shbar/  logical  INARRAY(20) 

Private  common  /Prbar/  integer  id,  np,  lim,  polarity 
INARRAY(l)  =  polarity 
polarity  =  .not.  polarity 
return 
end 

**••*••*••«•***•*•*•«**••******•****••**•*****•»***•***•••***••**•*••••••••••• 

Subroutine  Prjnit(tid.tnp) 
integer  tid.tnp 

Private  common  /Prbar/  integer  id,  np,  lim,  polarity 
id  =  tid 
np  =  tnp 
polarity  =  true. 

C  initialize  lim  such  that:  lim  =  2**n  <  np  <=  2**(n  +  l) 
lim  =  1 

10  if  (lim  It.  np)  then 
lim  =  lim  *  2 
goto  10 
end  if 

lim  *  lim  /  2 

return 

end 

Subroutine  Sh_init(tnp) 
integer  tnp 

Shared  common  /Shbar/  logical  INARRAY(20) 

INARRAY(l)  -  .false, 
do  10  i=  2,np 

unlock}  INARRAY(i)) 

#p«nlock(INARRAY(i)) 

10  continue 
return 
end 
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Compiler  issues:  optimized  out  references  to  shared  data? 


Consider  the  following  two  lines  of  Fortran  code.  Statement  pairs  like  this  can 
occur  in  several  of  the  barriers  that  have  been  developed  in  this  paper. 

flag  “■  .true. 

10  if  (flag  .eq.  .true)  goto  10 

If  flag  is  a  shared  variable,  and  another  process  is  expected  to  set  flag  to  false, 
then  we  see  that  this  pair  of  statements  is  perfectly  reasonable.  However,  if  a  conven¬ 
tional  high  performance  optimizing  compiler  got  hold  of  these  two  lines,  then  it  might 
well  optimize  out  the  second  reference  to  flag,  causing  an  infinite  loop. 

What  is  needed  is  a  new  generation  of  compilers  designed  for  parallel  languages. 
Such  compilers  would  be  free  to  fully  optimize  references  to  private  variables,  storing 
them  in  machine  registers,  etc.  But,  compilers  for  parallel  languages  should,  in  gen¬ 
eral,  never  optimize  out  references  to  shared  variables  in  the  code  that  they  produce. 

Since  many  present  compilers  do  not  meet  ttis  requirement,  it  may  be  necessary 
to  fool  a  compiler,  so  that  it  will  not  remove  memory  references  to  shared  variables. 
One  simple  way  to  do  this  is  to  put  one  or  both  of  the  statements  in  the  above  example 
into  a  subroutine.  For  a  language  such  as  Fortran,  assuming  parameters  are  passed  by 
address  (and  not  with  copy-restore),  this  "quick  fix"  is  sufficient  to  insure  that  all 
required  references  to  shared  variables  actually  occur.  This  is  the  approach  that  has 
been  adopted  for  the  algorithms  coded  in  Appendix  A.  A  second  alternative  is  to  code 
in  assembly  language.  Still  another  alternative  is  to  thoroughly  understand  the  com¬ 
piler  to  be  used,  before  programming  in  a  compiled  high  level  language. 
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